Langevin Dynamics

Langevin dynamics is a [[Stochastic Differential Equation (SDE)|stochastic differential equation]] ([[Stochastic Differential Equation (SDE)|SDE]]) that describes the motion of a particle under the combined influence of a deterministic drift (gradient of a potential) and random thermal fluctuations. In machine learning, it serves as a sampling algorithm that generates samples from a target distribution p(x) using only the [[Score Function|score function]] xlogp(x) — making it the foundational sampling mechanism behind score-based generative models and the corrector step in predictor-corrector diffusion samplers.


1. Core Concept

1.1 Physical Origin: Brownian Motion with Drift

Langevin dynamics originates from statistical physics, describing a Brownian particle in a potential field U(x) :

md2xdt2=U(x)γdxdt+2γkBTξ(t)

where:

  • m : particle mass
  • U(x) : deterministic force from potential U
  • γdxdt : friction (dissipation)
  • 2γkBTξ(t) : thermal fluctuations (white noise)
  • ξ(t) : standard Gaussian white noise, ξ(t)ξ(t)=δ(tt)

1.2 Overdamped Limit

In the overdamped limit ( m0 , friction dominant, inertial term negligible), the equation reduces to the overdamped Langevin equation:

dxt=U(xt)dt+2dWt

This is an [[Stochastic Differential Equation (SDE)|Itô SDE]] with:

  • Drift: b(x)=U(x) — follows the negative gradient of the potential
  • Diffusion: σ(x)=2 — constant additive noise

1.3 From Physics to Sampling

Replace the physical potential U(x) with the negative log-probability:

U(x)logp(x)U(x)=xlogp(x)

This yields the Langevin sampling equation:

dxt=xlogp(xt)dt+2dWt

The stationary distribution of this [[Stochastic Differential Equation (SDE)|SDE]] is exactly p(x) — meaning as t , xtp(x) .


2. Mathematical Foundation

2.1 Stationary Distribution

Theorem: The overdamped Langevin SDE

dxt=xlogp(xt)dt+2dWt

has p(x) as its unique stationary distribution under mild regularity conditions.

Proof sketch (via [[Fokker-Planck Equation|Fokker-Planck]]):

The [[Fokker-Planck Equation]] for this [[Stochastic Differential Equation (SDE)|SDE]] is:

ρt(x)t=(ρtlogp+ρt)

Setting ρtt=0 and substituting ρ=p :

abla(pablalogp+ablap)=abla(ablap+ablap)=0

2.2 Discrete-Time Approximation (Euler-Maruyama)

The continuous [[Stochastic Differential Equation (SDE)|SDE]] is discretized using the Euler-Maruyama scheme:

xk+1=xk+ηxlogp(xk)+2ηzk,zkN(0,I)

where η is the step size.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
def langevin_dynamics(score_fn, x_init, n_steps, step_size):
"""
Unadjusted Langevin Algorithm (ULA).

Args:
score_fn: Score function ∇_x log p(x)
x_init: Initial sample
n_steps: Number of Langevin steps
step_size: Step size η

Returns:
Sample approximately from p(x)
"""
x = x_init.clone()

for k in range(n_steps):
score = score_fn(x)
noise = torch.randn_like(x)
x = x + step_size * score + math.sqrt(2 * step_size) * noise

return x

2.3 Discretization Error

The Euler-Maruyama discretization introduces an O(η) error in the stationary distribution. The Metropolis-adjusted Langevin algorithm (MALA) corrects this with an accept-reject step:

Algorithm Acronym Accept/Reject Bias Variance
Unadjusted Langevin ULA ❌ No O(η) Lower
Metropolis-Adjusted MALA ✅ Yes Asymptotically unbiased Higher (rejections)
Stochastic Gradient Langevin SGLD ❌ No O(η) Lower (scalable)

2.4 Convergence Rate

Under log-concavity ( logp is μ -strongly convex and L -smooth), Langevin dynamics converges in Wasserstein-2 distance:

W2(ρt,p)W2(ρ0,p)eμt+O(dLμη)

Key takeaway: convergence is exponentially fast in continuous time, with discretization error O(ηd) .


3. Langevin Dynamics for Generative Modeling

3.1 Score-Based Sampling

The breakthrough insight of score-based generative modeling:

If we can learn xlogp(x) , we can sample from p(x) via Langevin dynamics — without ever computing the normalization constant.

1
2
3
4
5
6
7
8
9
10
11
12
13
Score-Based Sampling Pipeline
═══════════════════════════════════════
Data x₀ ~ p_data(x)


Learn: s_θ(x) ≈ ∇_x log p_data(x)


Sample: x_{k+1} = x_k + η·s_θ(x_k) + √(2η)·z_k


x_K ~ p_data(x) (approximate, for large K, small η)
═══════════════════════════════════════

3.2 Annealed Langevin Dynamics

Problem: A single score model struggles with multi-modal, complex distributions.

Solution (NCSN) : Train at multiple noise levels σ1<σ2<<σL , then anneal:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
def annealed_langevin(score_model, noise_levels, steps_per_level, step_size):
"""
Annealed Langevin dynamics (NCSN sampling).

Args:
score_model: Score model s_θ(x, σ)
noise_levels: [σ_L, ..., σ_1] (largest to smallest)
steps_per_level: Langevin steps per noise level
step_size: Step size η (typically η ∝ σ²)
"""
x = torch.randn(batch_size, *data_shape) # Start from noise

for sigma in reversed(noise_levels): # From large to small noise
# Adapt step size to noise level
alpha = step_size * (sigma / noise_levels[-1]) ** 2

for _ in range(steps_per_level):
score = score_model(x, sigma)
noise = torch.randn_like(x)
x = x + alpha * score + math.sqrt(2 * alpha) * noise

return x

Annealing schedule design:

Parameter Typical Value Rationale
σL (max) 1.0 – 10.0 Large enough to cover data modes
σ1 (min) 0.01 Small enough for precision
L (levels) 10 – 50 Geometric progression: σi+1/σi=const
Steps per level 10 – 100 Longer at smaller σ for finer detail
η (step size) ησ2 Ensures stable dynamics at each scale

3.3 Correctors in Predictor-Corrector Framework

In [[Diffusion Model|diffusion models]], Langevin dynamics serves as the corrector that refines samples:

1
2
3
4
5
6
7
8
9
10
11
12
Predictor-Corrector Sampling Loop
═══════════════════════════════════════
For each timestep t = T → 1:

1. PREDICTOR: Advance numerical ODE/SDE solver
x_t → x_{t-1} (via Euler, DPM-Solver, etc.)

2. CORRECTOR (Langevin): Refine sample using score
For k = 1 to N_corrector:
x_{t-1} = x_{t-1} + ε·∇_x log p_{t-1}(x) + √(2ε)·z

═══════════════════════════════════════

Why Langevin as corrector?

  • The predictor step may drift away from the true distribution
  • Langevin dynamics, given the exact score, converges toward the correct conditional distribution
  • A few corrector steps significantly improve sample quality

3.4 Comparison: Langevin vs. ODE vs. [[Stochastic Differential Equation (SDE)|SDE]] Sampling

Aspect Langevin Dynamics ODE (Probability Flow) Reverse [[Stochastic Differential Equation (SDE)|SDE]]
Stochasticity Stochastic Deterministic Stochastic
Score usage xlogp(x) as drift xlogp(x) in ODE drift xlogp(x) in [[Stochastic Differential Equation (SDE)|SDE]] drift
Convergence guarantee Yes ( t ) Path-dependent (fixed start) Path-dependent
Step efficiency Many steps needed Fewer steps ([[DPM-Solver]]) Many steps
Quality High (stochastic refinement) Good (fast) High
Role Corrector, standalone sampler Predictor Predictor

4. Algorithmic Variants

4.1 MALA: Metropolis-Adjusted Langevin Algorithm

MALA adds a Metropolis-Hastings accept-reject step to remove discretization bias:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
def mala_step(x, score_fn, step_size):
"""One step of Metropolis-Adjusted Langevin Algorithm."""
# Propose move via Langevin
noise = torch.randn_like(x)
x_proposed = x + step_size * score_fn(x) + math.sqrt(2 * step_size) * noise

# Compute log-acceptance ratio
# For Langevin, the proposal is symmetric up to discretization
score_current = score_fn(x)
score_proposed = score_fn(x_proposed)

# Log-density change (requires knowing log p, not just score)
# In practice: log_ratio = log p(x_proposed) - log p(x)
# + proposal correction terms

# Accept or reject
if torch.rand(1) < min(1, torch.exp(log_ratio)):
return x_proposed, True # Accept
else:
return x, False # Reject

4.2 SGLD: Stochastic Gradient Langevin Dynamics

For large datasets, SGLD uses mini-batch gradients:

xk+1=xk+ηkN|B|iBxlogp(xk|yi)+2ηkzk

where |B|N is the mini-batch size and ηk0 with ηk= , ηk2< .

Key properties:

  • Scalable to massive datasets
  • No accept-reject step (unadjusted)
  • Decreasing step size ensures convergence

4.3 Underdamped Langevin Dynamics

Reintroducing momentum (kinetic Langevin) for faster mixing:

dxt=vtdtdvt=γvtdt+xlogp(xt)dt+2γdWt

Advantages over overdamped:

  • Faster convergence (momentum reduces random-walk behavior)
  • Better exploration of multi-modal distributions
  • Used in advanced MCMC samplers (Hamiltonian Monte Carlo is a related approach)
1
2
3
4
5
6
7
def underdamped_langevin_step(x, v, score_fn, gamma, step_size):
"""One step of underdamped (kinetic) Langevin dynamics."""
noise = torch.randn_like(v)
v_new = v + step_size * score_fn(x) - gamma * step_size * v \
+ math.sqrt(2 * gamma * step_size) * noise
x_new = x + step_size * v_new
return x_new, v_new

5. Connection to Key Concepts

5.1 Langevin Dynamics → [[Score Function]]

Langevin dynamics is the primary consumer of the [[Score Function]] in generative modeling:

xk+1=xk+ηxlogp(xk)+2ηzkLangevin dynamics requiressθ(x)xlogp(x)Score function provides

Without the [[Score Function]], Langevin dynamics cannot sample. Without Langevin dynamics, the learned score has no sampling mechanism. This mutual dependency makes them the core pair of score-based generation.

5.2 Langevin Dynamics → [[Diffusion Model]]

In diffusion models, Langevin dynamics appears as:

  1. Corrector step: Refines samples after predictor ODE/[[Stochastic Differential Equation (SDE)|SDE]] steps
  2. Ancestral sampling connection: DDPM reverse process can be viewed as Langevin dynamics with a learned score
  3. Quality boost: Even 1-2 corrector Langevin steps significantly improve FID

5.3 Langevin Dynamics → [[Stochastic Differential Equation (SDE)]]

The overdamped Langevin equation is an Itô SDE:

dxt=xlogp(xt)drift b(x,t)dt+2diffusion σ(x,t)dWt

This connects to the general [[Stochastic Differential Equation (SDE)|SDE]] framework used in score-based generative models (Song et al., 2021), where different choices of b and σ define different diffusion processes.

5.4 Langevin Dynamics → [[Fokker-Planck Equation]]

The [[Fokker-Planck Equation]] provides the density-level description of Langevin dynamics:

tρt=(ρtlogp)+Δρt

This PDE describes how the ensemble distribution evolves — and proves that p(x) is the stationary solution.

5.5 Langevin Dynamics → [[Wiener Process|Wiener Process]]

The noise term dWt in Langevin dynamics is a [[Wiener Process|Wiener Process]] — the continuous-time limit of the discrete Gaussian noise zkN(0,I) . Without the [[Wiener Process|Wiener Process]], Langevin dynamics would collapse to deterministic gradient ascent and fail to explore the distribution.


6. Practical Implementation

6.1 Complete Langevin Sampler

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
class LangevinSampler:
"""Complete Langevin dynamics sampler with diagnostics."""

def __init__(self, score_fn, step_size=1e-3, steps=100,
noise_annealing=True, clipping=1e3):
self.score_fn = score_fn
self.step_size = step_size
self.steps = steps
self.noise_annealing = noise_annealing
self.clipping = clipping

def sample(self, x_init, return_trajectory=False):
x = x_init.clone()
trajectory = [x.clone()] if return_trajectory else None

for k in range(self.steps):
score = self.score_fn(x)

# Gradient clipping for stability
score = torch.clamp(score, -self.clipping, self.clipping)

# Optional: anneal noise
if self.noise_annealing:
eta = self.step_size * (1 - k / self.steps)
else:
eta = self.step_size

noise = torch.randn_like(x)
x = x + eta * score + math.sqrt(2 * eta) * noise

if return_trajectory:
trajectory.append(x.clone())

return (x, trajectory) if return_trajectory else x

def sample_multiple(self, n_samples, x_shape, device='cpu'):
"""Generate multiple independent samples."""
x = torch.randn(n_samples, *x_shape, device=device)
return self.sample(x)

6.2 Step Size Tuning

Symptom Likely Cause Fix
Diverging samples Step size too large Reduce η , add gradient clipping
No mixing (stuck) Step size too small Increase η , add momentum
Mode collapse Insufficient noise Use annealing schedule
High autocorrelation Underdamped needed Add momentum (kinetic Langevin)
Numerical instability Poor score estimate Gradient clipping, check score model

6.3 Computational Complexity

For d -dimensional data and K steps:

Component Cost per step Total cost
Score evaluation O(dparams) K× score cost
Noise generation O(d) O(Kd)
State update O(d) O(Kd)
Total O(Kdparams)

The dominant cost is score model evaluation — in diffusion models, the [[U-Net]]/[[DiT]] forward pass for each Langevin corrector step.


7. Theoretical Properties

7.1 Reversibility and Detailed Balance

The overdamped Langevin [[Stochastic Differential Equation (SDE)|SDE]] is reversible with respect to p(x) . This means the process satisfies detailed balance:

p(x)T(xy)=p(y)T(yx)

where T is the transition kernel. Reversibility ensures that p is indeed the invariant measure.

7.2 Ergodicity

Under mild conditions (positive density, smooth score, proper tails), Langevin dynamics is ergodic:

limT1T0Tf(xt)dt=Exp[f(x)]a.s.

This guarantees that time averages converge to ensemble averages — a crucial property for MCMC applications.

7.3 Mixing Time

The mixing time (time to reach ϵ -close in total variation) for a log-concave target:

τmix(ϵ)=O(1μlog(dϵ))

where μ is the strong convexity constant. Non-log-concave targets can have exponentially worse mixing.


8. Comparison with Other Sampling Methods

Method Gradient Stochastic Acceptance Scaling Best For
Langevin (ULA) Score only Yes No O(Kd2) Continuous, differentiable
MALA Score + log-p Yes Yes O(Kd2) Exact sampling, high-dim
HMC Score only Yes (implicit) Yes O(Kd2) Multi-modal, correlated
Gibbs None Conditional Yes O(Kd) Factorized conditionals
RW Metropolis None Yes Yes O(Kd2) Low-dim, non-diff.
Rejection None Yes Yes O(exp(d)) Low-dim only

Langevin advantage: Only needs xlogp(x) — no normalization constant, no accept-reject needed (ULA). This makes it uniquely suited for score-based deep generative models.


9. Core Formula Cards

# Formula Meaning
1 dxt=xlogp(xt)dt+2dWt Overdamped Langevin [[Stochastic Differential Equation (SDE)|SDE]] (continuous)
2 xk+1=xk+ηxlogp(xk)+2ηzk Euler-Maruyama discretization (ULA)
3 tρt=(ρtlogp)+Δρt [[Fokker-Planck Equation]] for Langevin
4 W2(ρt,p)W2(ρ0,p)eμt+O(ηd) Convergence rate (log-concave)
5 ξ(t)ξ(t)=δ(tt) White noise correlation ([[Wiener Process
6 U(x)xlogp(x) Physical potential ↔ probability connection

10. Summary

Langevin dynamics bridges statistical physics and deep generative modeling through a simple yet profound connection: the physical force U(x) becomes the [[Score Function]] xlogp(x) , and thermal noise becomes the exploration mechanism that ensures proper sampling.

Its three key roles in modern ML:

Role Context Significance
Standalone sampler Score-based models (NCSN) Generates samples from learned score without normalization
Corrector Predictor-corrector diffusion Refines samples, improves quality with 1-2 steps
Theoretical bridge [[Stochastic Differential Equation (SDE)|SDE]] ↔ Density evolution Links particle trajectories ([[Stochastic Differential Equation (SDE)|SDE]]) to distribution evolution (Fokker-Planck)

The equation itself is deceptively simple — dx=logpdt+2dW — yet it unifies MCMC sampling, score-based generation, and nonequilibrium statistical mechanics under one framework.


  • [[Score Function]]
  • [[Diffusion Model]]
  • [[Stochastic Differential Equation (SDE)]]
  • [[Fokker-Planck Equation]]
  • [[Wiener Process|Wiener Process]]
  • [[Probability Flow ODE]]
  • [[DDIM]]
  • [[DPM-Solver]]
  • [[Markov Process]]
  • [[Martingale]]
  • [[Metropolis-Hastings]]
  • [[Hamiltonian Monte Carlo]]